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Abstract 

Background: Randomised controlled trials (RCTs) are widely accepted as the preferred study design for evaluating 
healthcare interventions. When the sample size Is determined, a (target) difference Is typically specified that the RCT is 
designed to detect. This provides reassurance that the study will be informative, i.e., should such a difference exist, it is likely 
to be detected with the required statistical precision. The aim of this review was to identify potential methods for specifying 
the target difference In an RCT sample size calculation. 

Methods and Findings: A comprehensive systematic review of medical and non-medical literature was carried out for 
methods that could be used to specify the target difference for an RCT sample size calculation. The databases searched 
were MEDLINE, MEDLINE In-Process, EMBASE, the Cochrane Central Register of Controlled Trials, the Cochrane Methodology 
Register, PsyclNFO, Science Citation Index, EconLit, the Education Resources Information Center (ERIC), and Scopus (for in- 
press publications); the search period was from 1966 or the earliest date covered, to between November 2010 and January 
201 1. Additionally, textbooks addressing the methodology of clinical trials and International Conference on Harmonisatlon 
of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) tripartite guidelines for clinical trials 
were also consulted. A narrative synthesis of methods was produced. Studies that described a method that could be used 
for specifying an important and/or realistic difference were included. The search Identified 11,485 potentially relevant 
articles from the databases searched. Of these, 1,434 were selected for full-text assessment, and a further nine were 
Identified from other sources. Fifteen clinical trial textbooks and the ICH tripartite guidelines were also reviewed. In total, 
777 studies were included, and within them, seven methods were identified — anchor, distribution, health economic, 
opinion-seeking, pilot study, review of the evidence base, and standardised effect size. 

Conclusions: A variety of methods are available that researchers can use for specifying the target difference in an RCT 
sample size calculation. Appropriate methods may vary depending on the aim (e.g., specifying an Important difference 
versus a realistic difference), context (e.g., research question and availability of data), and underlying framework adopted 
(e.g., Bayesian versus conventional statistical approach). Guidance on the use of each method is given. No single method 
provides a perfect solution for all contexts. 

Please see later in the article for the Editors' Summary. 
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Introduction 

A randomised controlled trial (RCT) is widely regarded as the 
preferred study design for comparing the effectiveness of health 
interventions [1]. Central to the design and validity of an RCT is a 
calculation of the number of participants needed: the sample size. 
This provides reassurance that the study will be informative. Using 
the Neyman-Pearson method (a conventional approach to sample 
size calculation), a (target) difference that the RCT is designed to 
detect is typically specified. 

Selecting an appropriate target difference is critical. If too small 
a target difference is estimated, the trial may be a wasteful and an 
unethical use of data and resources. If too large a target difference 
is hypothesized, there is a risk that a clinically relevant difference 
wiU be overlooked because the study is too small. Both extremes 
could therefore have a detrimental impact on decision-making [2] . 
Additionally, through its impact on sample size, the choice of 
target difference has substantial implications in terms of study 
conduct and associated cost. 

However, unlike the statistical considerations involved in sample 
size calculation, research on how to specify the target difference 
has been greatly neglected, with no substantive guidance available 
[3,4]. While a variety of potential approaches have been proposed, 
such as specifying what an important difference would be (e.g., the 
"minimal clinically important difference") or what a realistic 
difference would be given the results of previous studies, the 
current state of the evidence base is unclear. Although some 
reviews of different types of methods have been conducted [2,5], 
there is still a need for a comprehensive review of available 
methods. The aim of this systematic review was to identify 
potential methods for specifying the target difference in an RCT 
sample size calculation, whether addressing an important differ- 
ence (a difference viewed as important by a relevant stakeholder 
group [e.g., clinicians]) and/or realistic difference (a difference 
that can be considered to be realistic given the interventions to be 
evaluated). The methods are described, and guidance offered on 
their use. 

Methods 

A comprehensive search of both biomedical and selected non- 
biomedical databases was undertaken. Search strategies and 
databases searched were informed by preliminary scoping work. 
The fmal databases searched were MEDLINE, MEDLINE In- 
Process, EMBASE, the Cochrane Central Register of Controlled 
Trials, the Cochrane Methodology Register, PsycINFO, Science 
Citation Index, EconLit, Education Resources Information Center 
(ERIC), and Scopus (for in-press publications) from 1966 or 
earliest date coverage; the searches were undertaken between 
November 2010 and January 2011. Given the magnitude of the 
literature identified by this initial search and the belief that 
updating the search would not lead to additional approaches of 
specifying the target difference, an update of this search was not 
carried out. There was no language restriction. It was anticipated 
that reporting of methods in the tides and abstracts would be of 
variable quality and that therefore a reliance on indexing and text 
word searching would be inadvisable. Consequently, several other 



methods were used to complement the electronic searching and 
included checking of reference lists, citation searching for key 
articles using Scopus and Web of Science, and contacting experts 
in the field. The protocol and details of the search strategies used 
are available in Protocol SI and Search Strategy SI. 

Additionally, textbooks covering methodological aspects of 
clinical trials were consulted. These textbooks were identified by 
searching the integrated catalogue of the British Library and the 
catalogues (for the most recent 5 y) of several prominent publishers 
of statistical texts. The project steering group was also asked to 
suggest key clinical trial textbooks that could be assessed. Because 
of the nature of the review, ethical approval was unnecessary. 

To be included in this review, each study had to report a formal 
method that had been used or could be used to specify a target 
difference. Any study design for original research was eligible, 
provided its assessment was based on at least one outcome of 
relevance to a clinical trial. Studies were excluded only if they were 
reviews, failed to report a method for specifying a target difference, 
reported only on statistical sample size considerations rather than 
clinical relevance, or assessed an outcome measure (e.g., number 
needed to treat) without reference to how a difference could be 
determined. 

Potentially relevant titles and abstracts were screened by either 
or both of two reviewers (J. H. or T. C), with any uncertainties or 
disagreements discussed with a third party (J. A. C). Full-text 
articles were obtained for the tides and abstracts identified as 
potentially relevant. These were provisionally categorised accord- 
ing to method of specifying the target difference (if detailed in the 
abstract). One of four reviewers (J. H., T. G., K. H., or T. E. A.) 
screened the full-text articles and extracted information, after 
having screened and extracted information from a practice sample 
of articles and compared results to ensure consistency in the 
screening process. Where there was uncertainty regarding whether 
or not a study should be included for data extraction, the opinion 
of a third party (J. A. C.) was sought, and the study discussed until 
consensus was reached. 

Data were extracted on the methodological details and any 
noteworthy features such as unique variations not found in other 
studies reporting the same method. Specific information relevant 
to each particular method was recorded, and no generic data 
extraction form was used across all methods. It was felt that a 
generic data extraction form that included all fields of relevance to 
all methods would be too cumbersome, because the methods 
varied in conception and implementation. 

Narrative descriptions of each method were produced, summa- 
rising the key characteristics based on extracted data on the 
similarities and differences in each application of the same 
method, frequency with which each variant of the method was 
used, and strengths and weaknesses of the method, either 
identified by the review team as potentially important, or extracted 
from study authors' own points about the strengths and limitations 
of their method (or methods) as reported in the articles. Methods 
were assessed according to criteria developed by the steering group 
prior to undertaking the evidence synthesis; the criteria covered 
the validity, implementation, statistical properties, and applicabil- 
ity of each method. The initial assessment was carried out by J. A. 
C. and revised by the steering group. 
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1 1485 titles and abstracts identified fi-om primary search: 



MEDLINE/Medline In-Process/EMBASE 


7189 


Central 


487 


SCI 


1898 


EconLlT 


255 


PsycINFO 


1367 


CMR 


10 


ERIC 


158 


Scopus 


121 


TOTAL 


11485 



1005 1 excluded at abstract 
screening stage 



1434 articles selected for full text assessment 



10 studies identified 
from other sources 



777 studies included 

253 anchor method only 

13 health economic method only 

171 distribution method only 

60 opinion-seeking method only 

5 pilot study method only 

22 review of the evidence base method only 

37 standardised effect size method only 

216 using more than one method* 



667 articles excluded: 

595 did not report a method 
(report of a clinical study, 
no methodology regarding 
choice of method provided, 
or statistical sample 
size/metric methods paper) 

67 Reference not traceable 

5 Review article of methods 
(references searched) 



Figure 1. PRISMA flow diagram. *For a breakdown of studies that used more than one method in combination, please see Table 1. Central, 
Cochrane Central Register of Controlled Trials; CMR, Cochrane Methodology Register; ERIC, Education Resources information Center; SCI, Science 
Citation Index. 

doi:1 0.1 371/journal.pmed.1 001 645.g001 



Results 

We identified 11,485 potentially relevant studies from the 
databa.ses searched. The number of studies found within each 
database is detailed in Figure 1 (PRISMA flow diagram), showing 
the number of studies for each method. 

Of the potentially relevant studies identified, 1,434 were selected 
for full-text assessment; a further nine were identified from other 
sources. Fifteen clinical trial textbooks and the International 
Conference on Harmonisation of Technical Requirements for 
Registration of Pharmaceuticals for Human Use tripartite guide- 
lines were also reviewed, though none identified a method that had 
not already been identified from the journal database searches. In 
total, 777 studies were included. Seven methods were identified — 
anchor, distribution, health economic, opinion-seeking, pilot study, 



review of the evidence base, and standardised effect size (SES). 
Descriptions of these methods are provided in Box 1 . No methods 
were identified by this review beyond those already known to the 
reviewers. The anchor, distribution, opinion-seeking, review of the 
evidence base, and SES methods were used in studies in varied 
clinical and treatment areas, but predominandy in those pertaining 
to chronic diseases. Although the number of included studies for 
both the health economic and pilot study methods was much 
smaller, real or hypothetical trial examples covered pharmacolog- 
ical and non-pharmacological treatments for both acute and 
chronic conditions. 

Substantial variation between studies was found in the way the 
seven methods were implemented. In addition, some studies used 
several methods, although the combinations used varied, as did the 
extent to which results were triangulated. The anchor method was 
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Box 1. Methods for Specifying an Important and/or Realistic Difference 



Methods for specifying an important difference 

• Anchor: The outcome of interest can be "anchored" by 
using either a patient's or health professional's judgement 
to define an Important difference. This may be achieved by 
comparing a patient's health before and after treatment 
and then linking this change to participants judged to 
have shown improvement/deterioration. Alternatively, a 
more familiar outcome, for which patients or health 
professionals more readily agree on what amount of 
change constitutes an important difference, can be used. 
Alternatively, a contrast between patients can be made to 
determine a meaningful difference. 

• Distribution: Approaches that determine a value based 
upon distributional variation. A common approach is to 
use a value that Is larger than the inherent imprecision in 
the measurement and therefore likely to represent a 
minimal level for a meaningful difference. 

• Health economic: Approaches that use principles of 
economic evaluation. These typically include both resource 
cost and health outcomes, and define a threshold value for 
the cost of a unit of health effect that a decision-maker is 
willing to pay, to estimate the overall net benefit of 
treatment. The net benefit can be analysed in a frequentist 
framework or take the form of a (typically Bayeslan) 
decision-theoretic value of information analysis. 

• Standardised effect size: The magnitude of the effect 
on a standardised scale defines the value of the difference. 
For a continuous outcome, the standardised difference 
(most commonly expressed as Cohen's d "effect size") can 
be used. Cohen's cutoffs of 0.2, 0.5, and 0.8 for small, 
medium, and large effects, respectively, are often used. 
Thus a "medium" effect corresponds simply to a change in 



the outcome of 0.5 SDs. Binary or survival (time-to-event) 
outcome metrics (e.g., an odds, risk, or hazard ratio) can be 
utilised in a similar manner, though no widely recognised 
cutoffs exist. Cohen's cutoffs approximate odds ratios of 
1.44, 2.48, and 4.27, respectively. Corresponding risk ratio 
values vary according to the control group event 
proportion. 

Methods for specifying a realistic difference 

• Pilot study: A pilot (or preliminary) study may be carried 
out where there is little evidence, or even experience, to 
guide expectations and determine an appropriate target 
difference for the trial. In a similar manner, a Phase 2 study 
could be used to inform a Phase 3 study. 

Methods for specifying an important and/or a 
realistic difference 

• Opinion-seeking: The target difference can be based 
on opinions elicited from health professionals, patients, 
or others. Possible approaches include forming a panel 
of experts, surveying the membership of a professional 
or patient body, or interviewing individuals. This 
elicitation process can be explicitly framed within a trial 
context. 

• Review of evidence base: The target difference can be 
derived using current evidence on the research question. 
Ideally, this would be from a systematic review or meta- 
analysis of RCTs. In the absence of randomised evidence, 
evidence from observational studies could be used in a 
similar manner. An alternative approach is to undertake a 
review of studies In which an important difference was 
determined. 



the most popular, used by 447 studies, of which 194 (43%) used it 
in combination with another method. The distribution method 
was used by 324 studies, of which 153 (47%) used it alongside 
another method. Eighty studies used an opinion-seeking method, 
of which 20 (25%) also used additional methods. Twenty-seven 
studies used a review of the evidence base method, of which five 
(19%) also used another method. Six studies used a pilot study 
method, of which one (17%) also used another method. The SES 
method was used by 166 studies, of which 129 (78%) also used 
another method. Thirteen studies used a health economic method. 

For all methods used in combination with others, Table 1 
provides a breakdown of the variety of combinations identified and 
their frequency. The main variations identified from the systematic 
review for each of the methods are described in Table 2, and are 
fiirther described in the text below. A brief summary of the 
hterature for each method is given below and also of studies that 
used a combination of methods. Table 3 contains an assessment of 
the value of the individual methods. Table 4 contains examples 
and key implementation points for die use of each method. 

Anchor Method 

Implementation of the anchor method varied greatly [6-37]. In 
its most basic form, the anchor method evaluates the minimal 
(clinically) important change in score for a particular instrument. 
This is established by calculating the mean change score (post- 
intervention minus pre-intervention) for that instrument, among a 
group of patients for whom it is indicated — via another instrument 



(the "anchor") — that a minimum clinically important change has 
occurred. The anchor instrument, the number of available points 
on the anchor instrument for response, and the corresponding 
labelling varied between applications. The anchor instrument was 
most often a subjective assessment of improvement (e.g., global 
rating of change), though objective measures of improvement 
could be used (e.g., a 15-letter change in visual acuity as measured 
on the Snellen eye chart) [34] . The anchor instrument was usually 
posed to patients alone [19,35], though in some cases the 
clinicians' views alone were used. Older studies tended to use a 
15-point Likert scale for the anchor instrument, as suggested by 
Jaeschke and colleagues [16]; more recent studies tended to use 
five- or seven-point scales instead. Depending upon the study size 
and/ or clinical context, merging of multiple points on the scale 
may be required. For example, if a seven-point scale has been used 
but very few people rate themselves at the extremes of this scale (1 
and 7), it may be possible to merge points 1 and 2 of the scale and 
points 6 and 7. It should be noted that it may not always be 
appropriate to do this, depending on the clinical question under 
consideration. 

Relative change can be incorporated by comparing those for 
whom an important change was identified to another patient 
subset (tested using the same instrument and anchor) who reported 
no change over time. Another common variation is to consider the 
percentage change score in the instrument under consideration 
[33], rather than the absolute score change. Determination of 
what constituted an important difference was sometimes based 
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Table 1. Use of multiple methods. 
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Methods Used In Combination <;tiiHiAc 
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Distribution 


Health 
Economic 


Opinion-Seeking 


Pilot Study 


Review of Evidence 
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doi:10.1371/journal.pmed.1001645.t001 



upon the use of methodology more typically used to assess 
diagnostic accuracy, such as receiver operating characteristic 
curves [6,11,20], or more complex statistical approaches. It is 
worth noting that the anchor method was not always successfial in 
deriving values for an important difTerence; failure was usually due 
to either practical or methodological difficulties [17,23]. 

A substantially different way of achieving an anchor-based 
approach for specilying an important difference was proposed by 
Redelmeier and colleagues [28]: in this study, other patients 
formed a reference against which a patient could rate their own 
health (or health improvement) [10,27—30]. GeneralisabHity of the 
resulting estimate of an important difference is a key concern. For 
example, if the disease is chronic and progressive, an important 
change value from a newly diagnosed population may not apply to 
a population with a far longer duration of illness [15,24,25,32,36]. 
A key consideration is how to decide on an appropriate cutoff 
point for the anchor "transition" tool. 

Participant biases, such as recall bias, are also potentially 
problematic [13,14,21,22,25], as are response shift (whereby 
patients' perceptions of acceptable change alter during the course 
of disease or treatment and become inconsistent) [37] and 
gratitude factor or halo bias (whereby responses that are more 
favourable than is realistic need to be taken into account) [31,35]. 
Another key choice is whether to consider improvement and 
deterioration together or separately. If a Likert scale has been used 
as the anchor, improvement and deterioration can be merged to 
obtain one more general measure for "change" by "folding" the 
scale at zero, though this assumes symmetry of effect, with "no 
change" centred upon zero difference. This approach may be 
unrealistic because of response biases and regression to the mean, 
and is inappropriate if patients are likely to rate improvements in 
their health differently from how they would rate deterioration 
with the same condition. The method proposed by Redelmeier 
and colleagues, where other participants act as the anchor, avoids 
recall bias because aU data can be collected at the same time. 



though it may not be a universally appropriate method, as 
participants might find it difficult to discuss particularly sensitive or 
private health issues with others. 

Distribution iVlethod 

Three distinct distribution approaches were found [38-56]: 
measurement error, statistical test, and rule of thumb. The 
measurement error approach determines a value that is larger 
than the inherent imprecision in the measurement and that is 
therefore likely to be consistently noticed by patients. The most 
common approach for determining this value was based upon the 
standard error of measurement (SEM). The SEM can be defined 
in various ways, with different multiplicative factors suggested as 
signifying a non- trivial (important) difference. 

The most commonly used alternative to the SEM method 
(although it can be thought of as an extension of this approach) 
was the rehable change index proposed by Jacobson and Truax 
[47], which incorporates confidence around the measurement 
error. For the statistical test approach, a "minimal detectable 
difference" — the smallest difference that could be statistically 
detected for a given sample size — is calculated. This is then used as 
a guide for interpreting the presence of an "important" difference 
in this study. The rule-of-thumb approach defines an important 
difference based on the distribution of the outcome, such as using a 
substantial fraction of the possible range without further justifica- 
tion (e.g., 10 mm on a 100-mm visual analogue scale measuring 
symptom severity being viewed as a substantial shift in outcome 
response) [54]. 

Measurement error and rule-of-thumb approaches are widely 
used, but do not translate straightforwardly to an RCT target 
difference. This is because for measurement error approaches, 
assessment is typically based on test-retest (within-person) data, 
whereas many trials are of parallel group (between-person) design. 
Additionally, measurement error is not suitable as the sole basis for 
determining the importance of a particular target difference. More 
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generally, the setting and timing of data collection may also be 
important to the calculation of measurement error (e.g., results 
may vary between pre- and post-treatment) [52]. The statistical 
test approach cannot be used to specify a priori a target difference 
in an RCT sample size calculation, as the observed precision of the 
statistical test is conditional on the sample size. Rule-of-thumb 
approaches are dependent upon the outcome having inherent 
value (e.g., Glasgow coma scale), where a substantial fraction of a 
unit change (e.g., one-third or one-half) can be viewed as 
important. 

Health Economic Method 

The approaches included under the health economic method 
typically involve defining a threshold value for the cost of a unit of 
health effect that a decision-maker is willing to pay and using this 
threshold to construct a "net benefit" that combines both resource 
cost and health outcomes [57-65]. The extent to which data on 
the differences in costs, benefits, and harms are used depends on 
the decision and perspective adopted (e.g., treatment x is better 
than treatment J when the net benefit for x is greater than that for 
y, i.e., the incremental net benefit for x compared toj is positive) 
[62]. The net benefit approach can be extended into a decision- 
theoretic model in order to undertake a value of information 
analysis [60,61,65], which seeks to address the value of removing 
the current uncertainty regarding the choice of treatment. The 
optimal sample size of a new study given the current evidence and 
the decision faced can be calculated. The perspective of the 
decision-making is critical, i.e., whether it is from the standpoint of 
clinicians, patients, funders, policy-makers, or some combination. 

More sophisticated modelling approaches can potentially allow 
a comprehensive evaluation of the treatment decision and the 
potential value of a new study, though they require strong 
assumptions about, for example, different measurements of 
effectiveness, harms, uptake, adherence, costs of interventions, 
and the cost of new research. The increased complexity, along 
with the gap between the input requirements of the more 
sophisticated modelling approaches and the data that are typically 
available, and the need to be explicit about the basis of synthesis of 
all the evidence upfront, perhaps explains the limited use of these 
modelling approaches in practice to date. 

Opinion-Seeking Method 

The opinion-seeking method determines a value (or a plausible 
range of values) for the target difference, by asking one or more 
individuals to state their view on what value or values for a 
particular difference should be important and/ or realistic [66-86] . 
The identified studies varied widely in whose opinion was sought 
(e.g., patients, clinicians, or trialists), the method of selecting 
individual experts (e.g., literature search, mailing list, or confer- 
ence attendance), and the number of experts consulted. Other 
variations included the method used to elicit values (e.g., interview 
or survey), the complexity of the data elicited, and the method 
used to consolidate results into an overall value or range of values 
for the difference. 

One advantage of the opinion-seeking method is the ease with 
which it can be carried out (e.g., through a survey). However, 
estimates will vary according to the specified population. 
Additionally, different perspectives (e.g., patient versus health 
professional) may lead to very different estimates of what is 
important and/or realistic [73]. Also, the views of approached 
individuals may not necessarily be representative of the wider 
community. Furthermore, some methods for eliciting opinions 
have feasibility constraints (e.g., face-to-face methods), but 
alternative approaches for capturing the views of a larger number 



of experts require careful planning or may be subject to low 
response rates or partial responses [77]. 

Pilot Study Method 

A small number of studies used a pilot study method to 
determine a relevant value for the target difference [87-90]. A 
pilot study can be defined as running the intended study in 
miniature prior to conducting the actual trial, to guide expecta- 
tions on an appropriate value for the target difference. The 
simplest approach is to use the observed effect in the pilot study as 
the target difference in an RCT. More sophisticated approaches 
account for imprecision in the estimate from the pilot study and/ 
or use the pilot study to estimate only the standard deviation (SD) 
(or control group event proportion) and not the target difference. 

However, there are practical difficulties in conducting a pilot 
study that may limit the relevance of results [87], most notably the 
inherent uncertainty in results due to the small study sample size, 
rendering the effect size imprecise and unreliable. Additionally, a 
pilot study can address only a realistic difference and does not 
inform what an important difference would be. Finally, it is worth 
noting that an internal pilot study, using the initial recruits within a 
larger study, cannot be used to pre-specify the target difference, 
though it could inform an adaptive update [90] . Notwithstanding 
the above critique, a pilot study can have a valuable role in 
addressing feasibility issues (e.g., recruitment challenges) that may 
need to be considered in a larger trial [89] . Pilot studies are most 
useful when they can be readily and quickly conducted. While few 
studies addressed using a pilot study to inform the specification of 
the target difference, trialists may use pilot studies to help 
determine the target difference without reporting this formally in 
trial reports. 

Review of the Evidence Base Method 

Implementation of the review of the evidence base method 
varied regarding what studies and results were considered as part 
of the review and how the findings of dififerent studies were 
combined [91-103]. The most common approach involved 
implementing a pre-specified strategy for reviewing the evidence 
base for either a particular instrument or variety of instruments to 
identify an imjjortant difference;. Alternatively, pre-existing studies 
for a specific research question may be used (e.g., using the pooled 
estimate of a meta-analysis) to determine the target difference 
[100]. Extending this general approach, Sutton and colleagues 
[101] derived a distribution for the effect of treatment from the 
meta-analysis, from which they then simulated the effect of a 
"new" study; the result of this study was added to the existing 
meta-analysis data, which were then re-analysed. Implicitiy this 
adopts a realistic difference as the basis for the target difference. 

Reviewing the existing evidence base is valuable as it provides a 
rationale for choosing an important and/or realistic target 
difference. It is likely that this general approach is often informally 
used, though few have addressed how it should be formally done. 
However, estimates identified from existing evidence may not 
necessarily be appropriate for the population being considered for 
the trial, so the generalisabUity of the available studies and 
susceptibility to bias should be considered. For reviews of studies 
that identified an important difference, the methods used in each 
of the individual studies to determine that difference are subject to 
the practical issues mentioned here for that method (e.g., the 
anchor method). Imprecision of the estimate is also an important 
consideration, and publication bias may also be an issue if reviews 
of the evidence base consider only published data. If a meta- 
analysis of previous results is used to determine a sample size, then 
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additional evidence published after the search used in the meta- 
analysis was conducted may necessitate updating the sample size. 

Standardised Effect Size Method 

This method is commonly used to determine the importance of 
a difference in an outcome when set in comparison to other 
possible effect sizes upon a standardised scale [88,104—116]. 
Overwhelmingly, studies used the guidelines suggested by Cohen 
[106] for the Cohen's d metric, i.e., 0.2, 0.5, and 0.8 for .small, 
medium, and large effects, respectively, in the context of a 
continuous outcome. Other SES metrics exist for continuous (e.g., 
Dunlap's d), binary (e.g., odds ratio), and survival (hazard ratio) 
outcomes [106,111,116]. Most of the literature relates to within- 
group SESs for a continuous outcome. The SD used should reflect 
the anticipated RCT population as far as possible. 

The main benefit of using a SES metiiot l is that it can be readUy 
calculated and compared across different outcomes, conditions, 
studies, settings, and people; all difierences are translated into a 
common metric. It is also easy to calculate the SES from existing 
evidence if studies have reported suflicient information. The 
Cohen guidelines for small, medium, and large effects can be 
converted into equivalent values for other binary metrics (e.g., 
1.44, 2.48, and 4.27, respectively, for odds ratio) [105]. As noted 
above, SES metrics are commonly used for binary (e.g., odds ratio 
or risk ratio) and survival outcomes (e.g., hazard ratio) in medical 
research [111], and a similar approach can be readily adopted for 
such outcomes. However, no equivalent guideline values are in 
widespread use. Informally, a doubling or halving of a ratio is 
sometimes seen as a marker of a large relative efiect [109]. 

It is important to note that SES values are not uniquely defined, 
and different combinations of values on the original scale can 
produce the same SES value. For the standard Cohen's d statistic, 
different combinations of mean and SD values produce the same 
SES estimate. For example, a mean (SD) of 5 (10) and 2 (4) both 
give a standardised effect of 0.5SD. As a consequence, specifying 
the target difference as a SES alone, though suflicient in terms of 
sample size calculation, can be viewed as insuflicient in that it does 
not actually define the target difference for the outcome measure 
of interest. A limitation of the SES is the difiiculty in determining 
why different effect sizes are seen in different studies: for example, 
whether these differences are due to differences in the outcome 
measure, intervention, settings, or participants in the studies, or 
study methodology. 

Combining Methods 

The vast majority of studies that combined methods used two or 
three of the anchor, distribution, and SES methods. Studies that 

used multiple methods were not always clear in describing whether 
and how results were triangulated, and for certain combinations 
the result of one method seemed to be considered of greater value 
than the result of another method (i.e., as if a primary and 
supplementary method had been selected). For example, values 
that were found using the anchor method were often chosen over 
effect size results or distribution-based estimates [117]. Alterna- 
tively, the most conservative value was chosen, regardless of the 
comparative robustness of the methods used [118]. In cases where 
the results of the different methods were similar, triangulation of 
the results was straightforward [119]. 

Discussion 

This comprehensive systematic review summarizes approaches 
for specifying the target difference in a RCT sample size 
calculation. Of the seven identified methods, the anchor. 



distribution, and SES methods were most widely used. Thc-re 
are several reasons for the popularity of these methods, including 
ease of use, usefulness in studies validating quality of life 
instruments, and simplicity of calculation of distribution and 
SES estimates alongside the anchor method. While most studies 
adopted (though typically imphcidy) the conventional Neyman- 
Pearson statistical framework, some of the methods (i.e., health 
economic and opinion-seeking) particularly suit a Bayesian 
framework. 

No further methods were identified by this review beyond the 
seven methods pre-identified from a stxjping search. However, 
substantial variations in implementation were noted, even for 
relatively simple approaches such as the anchor method, and 
many studies used multiple methods. Most studies focused on 
continuous outcomes, although other outcome types were 
considered using opinion-seeking and evidence base review. While 
the methods could in principle be used for any type of RCT, they 
are most relevant to the design of Phase 3, or "defmitixc", trials. 

A number of key issues were common across the methods. First, 
it is critical to decide whether the focus is to determine an 
important and/ or a realistic difference. Some methods can be used 
for both (e.g., opinion-seeking), and some for only one or the other 
(e.g., the anchor method to determine an important difference and 
the pilot study method to determine a realistic difference). 
Evaluating how the difference was determined and the context 
of determining the target difference is important. Some approach- 
es commonly used for specifying an important difference either 
cannot be used for specifying a target difference (such as the 
statistical test approach) or do not straightforwardly translate into 
the typical RCT context (for example the measurement error 
approach). The anchor, opinion-seeking, and health economic 
methods explicidy involve judgment, and the perspective taken in 
the study is a key consideration regarding their use. As a 
consequence, these methods explicitly allow different perspectives 
to be considered, and in particular enable the views of patients and 
the public to be part of the decision-making process. 

Some methodological issues are specific to particular methods. 
For example, the necessity of choosing a cutoff point to define an 
"important" difference /change is specific to the anchor method. 
This approach is a widely recognised part of the validation process 
for new quality of life instruments, where the scale has no inherent 
meaning without reference to an outside marker (i.e., anchor). 

AU three approaches of the distribution method — measurement 
error, statistical test, and rule of thumb — have clear limitations, 
the foremost being that they do not match the setting of a standard 
RCT design (two parallel groups). The statistical test approach 
cannot be used to specify a target difference, given that it is 
essentially a rearranged sample size formula. The rule-of-thumb 
approach is dependent upon the interpretabUity of the individual 
scale. 

The SES method was used in a substantial number of studies for 
a continuous outi:ome, but was rarely reported for non-continuous 
outcomes, despite informal use of such an approach probably 
being widespread. No parallel for a binary outcome exists, though 
odds ratio values approximately equivalent to Cohen's d values 
can be used. The validity of Cohen's cutoffs is uncertain (despite 
widespread usage), and some modifications to the original values 
have been proposed [120,121]. 

The opinion-seeking method was often used with multiple 
strategies involved in the process (e.g., questionnaires being sent to 
experts using particular sampling methods, followed by an 
additional conference being organised to discuss findings in more 
detail). The Delphi technique for survey development and the 
nominal group technique for face-to-face meetings are commonly 
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used and are potentially useful for this type of research when 
developing instruments. In terms of planning a trial, the opinion- 
seeking method can be relatively easy to implement, but the 
resulting usefulness of the estimated target difference may depend 
on the robustness of the approach used to elicit opinions. 

The health economic and pilot study methods were infrequentiy 
reported as specific methods. For the health economic method, 
this is likely due to the complexity of the method and/ or the 
resource-intensive procedures that are required to conduct the 
theoretically more robust variants that have been developed. The 
use of pilot studies to determine the target difference is 
problematic and probably only useful for the control group event 
proportion or SD, for a binary or continuous outcome, 
respectively. Internal pilot studies may be incorporated into the 
start of larger clinical trials, but are not useful for specifying the 
target difference, though they could be used to revise the sample 
size calculation. The review of the evidence base method can be 
applied to identify fjoth an important or realistic difference; a pilot 
study addresses only a realistic difference. For both methods, 
applicability to the anticipated study and the impact of statistical 
uncertainty on estimates should be considered. 

A review of the evidence base approach for a particular 
outcome measurement or study population may be combined with 
any of the other methods identified for establishing an important 
difference. However, the number of studies reporting a formal 
method for identifying an important difference using the existing 
evidence was surprisingly small. It could be that there is wide 
variation in the extent to which reviews of the existing evidence 
base have been undertaken prospectively using a specific and 
formal strategy*. 

Some methods can be readily used with others, potentially 
increasing the robustness of their findings. The anchor and 
distribution methods were often used together within the same 
study, frequentiy also with the SES approach. Multiple methods 
for specifying an important difference were used in some studies, 
though the combinations varied, as did the extent to which results 
were triangulated. The result of one method may validate th(" 
result found using another method, but conflicting estimates 
increase uncertainty over the estimate of an important difference. 

Strengths and Limitations 

To our knowledge, this review is the first comprehensi\ (- and 
systematic search of aU possible methods for specifying a target 
difference. The search strategy was inclusive, robust, and logical; 
however, this led to a large number of studies that did not report a 
method for specifying an important and/or realistic difference. 
Also, it is possible some studies were miss("d l)C'causc of the lack of 
standardised terminology. Finally, our search period ended in 
January 2011, and another method not included in the seven 
identified by this review may have been published since then. 
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Editors' Summary 

Background. A clinical trial is a research study in which 
human volunteers are randomized to receive a given 
intervention or not, and outcomes are measured in both 
groups to determine the effect of the intervention. Random- 
ized controlled trials (RCTs) are widely accepted as the 
preferred study design because by randomly assigning 
participants to groups, any differences between the two 
groups, other than the intervention under study, are due to 
chance. To conduct a RCT, investigators calculate how many 
patients they need to enroll to determine whether the 
intervention is effective. The number of patients they need to 
enroll depends on how effective the intervention is expected 
to be, or would need to be in order to be clinically important. 
The assumed difference between the two groups Is the target 
difference. A larger target difference generally means that 
fewer patients need to be enrolled, relative to a smaller target 
difference. The target difference and number of patients 
enrolled contribute to the study's statistical precision, and the 
ability of the study to determine whether the intervention 
is effective. Selecting an appropriate target difference is 
important from both a scientific and ethical standpoint. 

Why Was This Study Done? There are several ways to 
determine an appropriate target difference. The authors 
wanted to determine what methods for specifying the target 
difference are available and when they can be used. 

What Did the Researchers Do and Find? To identify 
studies that used a method for determining an important 
and/or realistic difference, the investigators systematically 
surveyed the research literature. Two reviewers screened each 
of the abstracts chosen, and a third reviewer was consulted if 



necessary. The authors identified seven methods to determine 
target differences. They evaluated the studies to establish 
similarities and differences of each application. Points about 
the strengths and limitations of the method and how 
frequently the method was chosen were also noted. 

What Do these Findings Mean? The study draws 
attention to an understudied but important part of design- 
ing a clinical trial. Enrolling the right number of patients is 
very important — too few patients and the study may not be 
able to answer the study question; too many and the study 
will be more expensive and more difficult to conduct, and 
will unnecessarily expose more patients to any study risks. 
The target difference may also be helpful in interpreting the 
results of the trial. The authors discuss the pros and cons of 
different ways to calculate target differences and which 
methods are best for which types of studies, to help inform 
researchers designing such studies. 

Additional Information. Please access these websites via 
the online version of this summary at http://dx.doi.org/10. 
1 371 /journal.pmed.l 001 645. 

• Wikipedia has an entry on sample size determination that 
discusses the factors that influence sample size calculation, 
including the target difference and the statistical power of 
a study (statistical power is the ability of a study to find a 
difference between treatments when a true difference 
exists). (Note: Wikipedia is a free online encyclopedia that 
anyone can edit; available in several languages.) 

• The University of Ottawa has an article that explains how 
different factors influence the power of a study 
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